Evaluating static models on RTEB

static models | Oct 8, 2025

The group of researchers associated with the Massive Text Embedding Benchmark (MTEB) has released a new benchmark: the Retrieval Text Embedding Benchmark. As you may know, MTEB ranks models on their ability to perform well at a variety of tasks in a zero-shot setting, and is meant to reflect how well your model transfers to new tasks. Ranking high on MTEB can make or break your model, so it has become something that people optimize for, and as Goodhart put it: “when a measure becomes a target, it ceases to become a good measure”.

Comparing PCA and MRL for static models

static models | Oct 6, 2025

Without reducing dimensionality, static models can be hundreds of MB large. Choosing the right dimensionality-reduction technique can shrink them without sacrificing retrieval quality. I was always a huge fan of Principal Component Analysis (PCA) for making static models smaller. For example, PCA is used in model2vec and was used in an older version of tokenlearn to post-process models, and is used in the newer version of tokenlearn to reduce the dimension of the teacher models.(¹) PCA was always on my mind as a good option for reducing dimensions. Recently, however, I started experimenting with Matryoshka Representation Learning (MRL) for reducing dimensions and have found it to be superior, which I found surprising. This blog post thus tries to answer the question: when should you be using PCA and MRL? If one is better than the other, why? I discuss both techniques, why applying dimensionality reduction to static embeddings makes sense, and some options for future work.

I am no longer a maintainer or owner of these projects, but added the functionality while I was still at Minish. ↩

Static late interaction models

static models | Sep 30, 2025

Late interaction is an interesting paradigm for computing the similarity between two documents, and can be seen as a hybrid of sparse and dense retrieval. In this post, I will show how static models in a late interaction setting actually reduce to sparse models. I will also argue that, in absence of empirical evidence to the contrary, there’s no good reason to assume that static late interaction models will be much better than their dense counterparts. But first, let’s dive into some fundamentals: I’ll explain what sparse retrieval and dense retrieval are, and how late interaction fits in with both of those paradigms.

Better Greedy Tokenizers: Handling WordPiece's [UNK] Problem

tokenization | Sep 18, 2025

In a previous post, I showed that making a tokenizer greedy, that is, always picking the longest matching subword like WordPiece does, can improve results without retraining. But WordPiece can unfortunately silently break your tokenization.

Note: alternative to regex splitting in byte tokenizers

tokenization | Aug 12, 2025

In a previous note, I discussed an alternative for setting split to true in a ByteLevel pretokenizer. I suggested using a ByteLevel normalizer first, and then splitting using a complicated regex in “byte space”. However, this turned out to not work very well: there are certain character classes in an original Regex, such as \s, that are very difficult to convert to a pattern in byte space.

Separate Normalization from Splitting in ByteLevel tokenizers

tokenization | Aug 12, 2025

NOTE

Turning any tokenizer into a greedy one

tokenization | Aug 10, 2025

I recently re-read Greed is All You Need: An Evaluation of Tokenizer Inference Methods. In this paper, the authors show that switching out inference methods for tokenizers can improve performance on various tasks.

Tokenizer decasing

tokenization and casing | Aug 1, 2025

In this post I will talk about something I call tokenizer de_casing. _decasing is very similar to putting a lowercase normalizer in front a of a tokenizer, but works better.

kwargs.pop is probably a code smell

python and typing | Mar 28, 2025

Sometimes I see something like this:

Using overload to handle tagged union return types

python and typing | Mar 28, 2025

Here’s a function with an idiom I’ve seen a lot (probably copied from sentence-transformers):